Editing Rules: Discovery and Application to Data Cleaning
نویسندگان
چکیده
Dirty data is a serious problem for businesses, leading to incorrect decision making, inefficient daily operations, and ultimately wasting both time and money. A variety of integrity constraints like Conditional Functional Dependencies (CFD) have been studied for data cleaning. Data repairing methods based on these constraints are strong to detect inconsistencies but are limited on how to correct data, worse they can even introduce new errors. Based on Master Data Management principles, a new class of data quality rules known as Editing Rules (eR) tells how to fix errors, pointing which attributes are wrong and what values they should take. However, finding data quality rules is an expensive process that involves intensive manual efforts. In this paper, we develop pattern mining technics for discovering eRs from existing source relations (eventually dirty) with respect to master relations (supposed to be clean and accurate). In this setting, we propose a new semantic of eRs taking advantage of both source and master data. The problem turns out to be strongly related to the discovery of both CFD and one-to-one correspondences between sources and target attributes. We have proposed efficient technics to address these two subproblems. We have implemented and evaluated our technics on real-life databases. Experiments show both the feasibility, the scalability and the robustness of our proposition.
منابع مشابه
Discovering Editing Rules For Data Cleaning
Dirty data continues to be an important issue for companies. The database community pays a particular attention to this subject. A variety of integrity constraints like Conditional Functional Dependencies (CFD) have been studied for data cleaning. Data repair methods based on these constraints are strong to detect inconsistencies but are limited on how to correct data, worse they can even intro...
متن کاملRègles d’Edition: Fouille et Application au Nettoyage de Données
Dirty data is a serious problem for businesses, leading to incorrect decision making, inefficient daily operations, and ultimately wasting both time and money. A variety of integrity constraints like Conditional Functional Dependencies (CFD) have been studied for data cleaning. Data repairing methods based on these constraints are strong to detect inconsistencies but are limited on how to corre...
متن کاملResearch of Data Cleaning Methods Based on Dependency Rules
This paper introduces the concept and principle of data cleaning, analyzes the types and causes of dirty data, and proposes several key steps of typical cleaning process, puts forward a well scalability and versatility data cleaning framework, in view of data with attribute dependency relation, designs several of violation data discovery algorithms by formal formula, which can obtain inconsiste...
متن کاملDiscovering Data Quality Rules in a Master Data Management
Dirty data continues to be an important issue for companies. The datawarehouse institute [Eckerson, 2002], [Rockwell, 2012] stated poor data costs US businesses $611 billion dollars annually and erroneously priced data in retail databases costs US customers $2.5 billion each year. Data quality becomes more and more critical. The database community pays a particular attention to this subject whe...
متن کاملData Cleaning using Probabilistic Models of Integrity Constraints
In data cleaning, data quality rules provide a valuable tool for enforcing the correct application of semantics on a dataset. Traditional rule discovery techniques assume a reasonably clean dataset, and fail when faced with a dirty one. Enforcement of these rules for error detection is much less effective when mined on dirty data. In the databases literature, a popular and expressive type of lo...
متن کامل